All that is English may be Hindi: Enhancing language identification through automatic ranking of likeliness of word borrowing in social media

نویسندگان

  • Jasabanta Patro
  • Bidisha Samanta
  • Saurabh Singh
  • Abhipsa Basu
  • Prithwish Mukherjee
  • Monojit Choudhury
  • Animesh Mukherjee
چکیده

In this paper, we present a set of computational methods to identify the likeliness of a word being borrowed, based on the signals from social media. In terms of Spearman correlation coefficient values, our methods perform more than two times better (nearly 0.62) in predicting the borrowing likeliness compared to the best performing baseline (nearly 0.26) reported in literature. Based on this likeliness estimate we asked annotators to re-annotate the language tags of foreign words in predominantly native contexts. In 88 percent of cases the annotators felt that the foreign language tag should be replaced by native language tag, thus indicating a huge scope for improvement of automatic language identification systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

All that is English may be Hindi: Enhancing language identification through automatic ranking of the likeliness of word borrowing in social media

In this paper, we present a set of computational methods to identify the likeliness of a word being borrowed, based on the signals from social media. In terms of Spearman’s correlation values, our methods perform more than two times better (∼ 0.62) in predicting the borrowing likeliness compared to the best performing baseline (∼ 0.26) reported in literature. Based on this likeliness estimate w...

متن کامل

Is this word borrowed? An automatic approach to quantify the likeliness of borrowing in social media

Code-mixing or code-switching refer to the phenomenon of effortless and natural switching between two or more languages in a single conversation, sometimes even in a single utterance, by multilingual speakers. However, use a foreign word in a language does not necessarily mean that the speaker is code-switching, because often languages borrow lexical items from other languages. If a word is bor...

متن کامل

Identifying Languages at the Word Level in Code-Mixed Indian Social Media Text

Language identification at the document level has been considered an almost solved problem in some application areas, but language detectors fail in the social media context due to phenomena such as utterance internal code-switching, lexical borrowings, and phonetic typing; all implying that language identification in social media has to be carried out at the word level. The paper reports a stu...

متن کامل

Code Mixing: A Challenge for Language Identification in the Language of Social Media

In social media communication, multilingual speakers often switch between languages, and, in such an environment, automatic language identification becomes both a necessary and challenging task. In this paper, we describe our work in progress on the problem of automatic language identification for the language of social media. We describe a new dataset that we are in the process of creating, wh...

متن کامل

The Presence and Influence of English in the Portuguese Financial Media

As the lingua franca of the 21st century, English has become the main language for intercultural communication for those wanting to embrace globalization. In Portugal, it is the second language of most public and private domains influencing its culture and discourses. Language contact situations transform languages by the incorporations they make from other languages and Portugal has...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1707.08446  شماره 

صفحات  -

تاریخ انتشار 2017